(2 min read)
This study uses the historical Lahman Baseball Database to analyze
the differences in peak offensive age (measured by OPS) among MLB
players at different fielding positions. By comparing the peak ages of
players across positions, we examine whether certain positions tend to
reach their offensive prime earlier or later.
We also further
investigate whether height and weight are associated
with peak age and peak OPS, aiming to understand offensive
characteristics across different positions.
The Lahman Database, created by Sean Lahman, contains comprehensive
MLB statistics from 1871 to the present (current version includes data
up to 2023). The database consists of many tables; for this study, the
main tables used are:
| Table Name | Description |
|---|---|
| Batting | Includes batting statistics such as hits, home runs, RBIs, etc. |
| People | Includes player information such as name, birthdate, and physical data. |
| Appearances | Includes information on player fielding positions. |
Example Shohei Ohtani:
### {.tabset} #### People
To evaluate offensive performance, we adopt On-base Plus
Slugging (OPS) as the primary indicator. Before calculating
OPS, we briefly introduce its components:
| Position | Code | Description |
|---|---|---|
| Pitcher (P) | 1 | The pitcher not only throws during the defensive inning but also fields. However, pitchersā batting is not considered in this study. |
| Catcher (C) | 2 | Receives pitches and controls defense. |
| First Base (1B) | 3 | Handles many throws from other infielders. |
| Second Base (2B) | 4 | Defends the right side and middle infield. |
| Shortstop (SS) | 5 | Defends the left side and middle infield. |
| Third Base (3B) | 6 | Defends left side infield; requires strong arm. |
| Left Field (LF) | 7 | Covers left outfield; strong arm needed. |
| Center Field (CF) | 8 | Covers middle outfield; requires speed and range. |
| Right Field (RF) | 9 | Covers right outfield; also requires strong arm. |
| Designated Hitter (DH) | - | Only hits, no fielding responsibilities. |
(1 min read)
With the rise of sports science and analytics, performance prediction
and strategy analysis have become popular in professional baseball. This
inspired me to apply R programming to baseball data, combining my
academic learning with personal passion for baseball.ć
(3 min read)
Before analysis, we define a KPI function to compute playersā age,
OBP, SLG, and OPS. Note: For age calculation, players born after July
are considered as born in the following year.
Note: For age
calculation, players born after July are considered as born in the
following year.
KPI functionļ¼
KPI <- function(player_id) {
player_batting <- Batting |> filter(playerID == player_id)
player_info <- People |> filter(playerID == player_id)
merged_data <- merge(player_batting, player_info, by = "playerID")
merged_data |>
mutate(
birthyear = if_else(birthMonth >= 7, birthYear + 1, birthYear),
Age = yearID - birthyear,
OBP = round((H + BB + HBP) / (AB + BB + HBP + SF), 3),
SLG = round((H - X2B - X3B - HR + 2 * X2B + 3 * X3B + 4 * HR) / AB, 3),
OPS = round(SLG + OBP, 3)
) |>
select(Age, OBP, SLG, OPS)
}Example with Shohei Ohtani:
Next, we define a Position function to add each playerās primary
fielding position, so that we can further analyze offensive performance
by position.
The logic here is to count the number of appearances at
each fielding position in the Appearances table, and
then assign the position with the highest count as the playerās main
position for that year.
Position <- function(player_id) {
pos_cols <- c("G_c", "G_1b", "G_2b", "G_3b",
"G_ss", "G_lf", "G_cf", "G_rf", "G_dh")
pos_names <- c("C", "1B", "2B", "3B",
"SS", "LF", "CF", "RF", "DH")
appearances <- Appearances |>
filter(playerID == player_id)
total_games <- colSums(appearances[pos_cols], na.rm = T)
main_pos <- pos_names[which.max(total_games)]
main_pos
}Again, letās take Shohei Ohtani as an example to check the position information:
Next, we apply the KPI and Position functions to all hitters,
excluding players with too few at-bats in a season (e.g., due to injury
or pitchers batting).
We then organize the results into one large
dataset.
players <- Batting |>
filter(AB >= 100) |>
distinct(playerID) |>
pull(playerID)
batting_data <- lapply(players, function(pid) {
kpi_stats <- KPI(pid)
pos <- Position(pid)
kpi_stats |>
mutate(playerID = pid, Position = pos) |>
select(playerID, Age, OBP, SLG, OPS, Position)
}) |>
bind_rows()We print the first few rows to check the results:
Finally, we add player attributes such as birthplace, height, and
weight.
info_data <- People |>
select(playerID, birthCountry, height, weight) |>
select(playerID, height, weight)
batting_data <- merge(batting_data, info_data, by = "playerID")We again print the first few rows to check:
(Height and Weight units are inches and pounds,
respectively.)
In addition, we extract each playerās peak season (the year with the
highest OPS) into a new dataset for further analysis.
batting_topOPS <- batting_data |>
group_by(playerID) |>
filter(OPS == max(OPS, na.rm = T)) |>
slice(1) |>
ungroup()After this data processing, we now have two datasets:
- batting_data: annual statistics of hitters
- batting_topOPS: only the peak season (highest OPS
year) for each hitter
These datasets will be used for subsequent analysis.
(6 min read)
First, letās look at scatterplots of OPS versus Age for several
players.
Since Shohei Ohtani is still an active
player with relatively few data points, we instead select historically
well-known players from different positions for illustration.
(Here
we choose: Right Fielder (RF) Hank Aaron, Catcher (C) Mike Piazza,
Shortstop (SS) Derek Jeter, and Designated Hitter (DH) Edgar
Martinez.)
aaronha = batting_data |>
filter(playerID == "aaronha01") |>
ggplot() +
geom_point(aes(Age, OPS)) +
scale_x_continuous(limits = c(20, 45)) +
labs(title = "Hank Aaron (OF)", x = "Age", y = "OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
piazzmi = batting_data |>
filter(playerID == "piazzmi01") |>
ggplot() +
geom_point(aes(Age, OPS)) +
scale_x_continuous(limits = c(20, 45)) +
labs(title = "Mike Piazza (C)", x = "Age", y = "OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
jeterde = batting_data |>
filter(playerID == "jeterde01") |>
ggplot() +
geom_point(aes(Age, OPS)) +
scale_x_continuous(limits = c(20, 45)) +
labs(title = "Derek Jeter (SS)", x = "Age", y = "OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
martied = batting_data |>
filter(playerID == "martied01") |>
ggplot() +
geom_point(aes(Age, OPS)) +
scale_x_continuous(limits = c(20, 45)) +
labs(title = "Edgar Martinez (DH)", x = "Age", y = "OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
(aaronha + piazzmi) / (jeterde + martied) + plot_annotation(title = "OPS by Age",
theme = theme(plot.title = element_text(hjust = 0.5, size = 18)))From the plots, we can see that the peak OPS age differs across
positions. For example, Hank Aaron reached his highest OPS at age 37,
while Derek Jeter peaked much earlier at age 25.
We can also observe
that playersā career trajectories generally form a parabolic shape
(similar to a quadratic regression curve). We will now further analyze
and test this phenomenon.
Next, we test whether playersā OPS trajectories follow the
assumptions of a quadratic regression model:
Call:
lm(formula = OPS ~ Age + I(Age^2), data = batting_data)
Residuals:
Min 1Q Median 3Q Max
-0.6840 -0.0824 0.0194 0.1086 4.3184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.404e-01 3.671e-02 3.824 0.000131 ***
Age 3.504e-02 2.531e-03 13.845 < 2e-16 ***
I(Age^2) -5.648e-04 4.304e-05 -13.123 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2008 on 37257 degrees of freedom
(24518 observations deleted due to missingness)
Multiple R-squared: 0.006361, Adjusted R-squared: 0.006308
F-statistic: 119.3 on 2 and 37257 DF, p-value: < 2.2e-16
| Variable | P-value | Adjusted R² | Interpretation |
|---|---|---|---|
| Age |
2e-16 Estimate: 0.035 |
0.0063 | For each additional year of age, OPS increases by about 0.035, indicating a positive relationship between Age and OPS. |
| I(Age²) | 2e-16 | Quadratic effect: OPS increases first and then decreases, implying the existence of a peak period. |
We then examine the residual plots of this model to check whether the
assumptions of quadratic regression hold:
Although the model shows a statistically significant quadratic
relationship between Age and OPS, the diagnostic plots reveal skewness
and heteroskedasticity in residuals. This indicates prediction errors
are not fully random, so interpretation should be cautious.
Next, we draw a quadratic regression curve to visualize the
relationship between Age and OPS:
batting_data |>
ggplot() +
geom_smooth(aes(Age, OPS), method = "lm",formula = y ~ x + I(x^2), size = 1.5) +
theme_bw() +
labs(title = "Age vs OPS", x= "Age", y= "OPS") +
theme(plot.title = element_text(hjust = 0.5, size = 18)) (Intercept) Age I(Age^2)
0.1403716751 0.0350429541 -0.0005647756
a <- model$coefficients["I(Age^2)"]
b <- model$coefficients["Age"]
c <- model$coefficients["(Intercept)"]
vertex_x <- -b / (2 * a)
vertex_y <- a * vertex_x^2 + b * vertex_x + c
print(paste("é é»ä½ē½®ļ¼Age =", round(vertex_x), ", OPS =", round(vertex_y, 3)))[1] "é é»ä½ē½®ļ¼Age = 31 , OPS = 0.684"
We can see that hitters reach their maximum OPS of about 0.684 at around age 31.
Now letās further examine how this AgeāOPS relationship differs by
fielding position:
batting_data |>
ggplot() +
geom_smooth(aes(Age, OPS), method = "lm", formula = y ~ x + I(x^2), size = 1.5) +
scale_x_continuous(limits = c(18, 45)) +
facet_wrap(~ Position, ncol = 3) +
theme_bw() +
labs(title = "Age by Fielding Position vs OPS", x = "Age", y = "OPS") +
theme(plot.title = element_text(hjust = 0.5, size = 18))From these facet plots, we can see that OPS versus Age follows a
parabolic trend across all fielding positions, but the exact peak ages
and OPS levels differ by position.
Here, we restrict the x-axis
(Age) to 18ā45 years to avoid distortion from a few extreme outliers
(very young or very old players with few appearances).
We can see that players across positions reach their peak OPS at
different ages, and the peak OPS values also differ.
Next, we will
explore whether this variation is related to playersā height and
weight.
It is important to note that the quadratic curves
above are based on all annual data across all players, so they represent
the average OPS by age ā not that every player peaks exactly at age
31.
Therefore, in the following analysis, we will switch to the
batting_topOPS dataset, which only retains each
playerās best OPS season, to study peak ages more directly by
position.
Next, letās look at the scatter plot of playersā height and
weight:
batting_topOPS |>
ggplot() +
geom_point(aes(weight, height)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 18)) +
labs(title = "Scatter Plot of MLB Hittersā Weight and Height", x = "Weight", y = "Height") We can generally see that playersā heights are concentrated in the
range of 6ā7āā7ā7ā, and their weights are mostly within 150ā225 lbs.
It is also clear that players can roughly be divided into two
categories: taller/heavier type and
shorter/lighter type.
Next, letās examine the scatter plot of height and weight by fielding
position:
batting_topOPS |>
ggplot() +
geom_point(aes(weight, height)) +
facet_wrap(~ Position) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 18)) +
labs(title = "Scatter Plot of MLB Hittersā Weight and Height by Fielding Position", x = "Weight", y = "Height") From the scatterplots, we can see that Designated Hitters (DH) tend
to be taller and heavier, while Catchers (C), Second Basemen (2B), and
Shortstops (SS) are relatively shorter and lighter.
Scatterplots only provide a general sense of the trend, so next we
use sorted boxplots to gain a clearer view:
p_height = batting_topOPS |>
ggplot() +
geom_boxplot(aes(reorder(Position, height, FUN = median), height)) +
labs(x = NULL)
p_weight = batting_topOPS |>
ggplot() +
geom_boxplot(aes(reorder(Position, weight, FUN = median), weight)) +
labs(x = "Position")
p_height / p_weight + plot_annotation(title = "Height and Weight Box Plot by Fielding Position",
theme = theme(plot.title = element_text(hjust = 0.5, size = 18,)))We can observe that Designated Hitters (DH) and First Basemen (1B)
have higher medians for both height and weight.
In contrast, positions requiring speed and agility ā such as Second Base
(2B), Shortstop (SS), Third Base (3B), and Center Field (CF) ā have
relatively lower median height and weight.
First, letās look at a table showing the average peak ages (year of
highest OPS) by fielding position:
position_summary <- batting_topOPS |>
group_by(Position) |>
summarise(avg_Age = round(mean(Age),2)) |>
arrange(desc(avg_Age))
position_summaryResidual plots by position:
batting_topOPS |>
group_by(Position) |>
summarise(age0 = mean(Age)) |>
mutate(dev = age0 - mean(age0, na.rm = T),
hjust = ifelse(dev > 0, 1.07, -0.07)) |>
ggplot() +
geom_col(aes(reorder(Position, dev),dev,
fill = dev > 0), show.legend = F) +
geom_text(
aes(x = reorder(Position, dev), y = dev, label = round(dev, 2), hjust = hjust),
vjust = 0.4,
size = 3.3,
color = "white",
fontface = "bold",
) +
coord_flip() +
scale_fill_manual(values = c("FALSE" = "red2", "TRUE" = "black")) +
labs(title = "Max-OPS Age Deviation by Position", x = "Fielding Position", y = "Deviation from Mean Age at Max OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 18)
,axis.text.y = element_text(size = 10))
We can see that the four positions with the latest average peak ages are
Designated Hitter (DH), Catcher (C), First Base (1B), and Left Field
(LF), though the actual differences are quite small.
Next, letās examine a heatmap of Height, Weight, and Age:
xbk = batting_topOPS %$% seq(min(weight, na.rm = T), max(weight, na.rm = T), length = 10)
xbk1 = round(xbk[-1] - diff(xbk)[1]/2, 1)
ybk = batting_topOPS %$% seq(min(height, na.rm = T), max(height, na.rm = T), length = 10)
ybk1 = round(ybk[-1] - diff(ybk)[1]/2, 1)
batting_topOPS |>
mutate(
weight_bin = cut(weight, breaks = xbk, labels = xbk1, include.lowest = T),
height_bin = cut(height, breaks = ybk, labels = ybk1, include.lowest = T)
) |>
group_by(weight_bin, height_bin) |>
summarise(kpi = mean(Age, na.rm = T)) |> ungroup() |>
complete(weight_bin, height_bin, fill = list(kpi = NA)) |>
ggplot(aes(weight_bin, height_bin)) +
geom_tile(aes(fill = kpi)) +
geom_text(aes(label = round(kpi, 1)), color = "white", size = 4, hjust = 0.5) +
scale_fill_gradient(low = "gray", high = "#000" , na.value = "white") +
labs(title = "Max-OPS Avg Age by Weight and Height Combination in MLB Hitters", x = "Weight", y = "Height", fill = "Max-OPS Avg Age") +
theme_bw() +
theme(plot.title = element_text(hjust = 1, size = 18),
axis.text.x = element_text(hjust = 1),
legend.title = element_text(hjust = 0.5) )
It can be observed that height and weight do not show a clear
relationship with peak age.
We then run a regression analysis of peak age against height and
weight to test for statistical significance.
Call:
lm(formula = Age ~ height + weight, data = batting_topOPS)
Residuals:
Min 1Q Median 3Q Max
-8.9927 -2.5318 -0.4767 2.1320 14.9971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.250505 2.050601 14.264 < 2e-16 ***
height 0.010230 0.032136 0.318 0.75
weight -0.014757 0.003178 -4.643 3.54e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.642 on 4251 degrees of freedom
Multiple R-squared: 0.006819, Adjusted R-squared: 0.006352
F-statistic: 14.59 on 2 and 4251 DF, p-value: 4.828e-07
| Variable | P-value | Adjusted R² | Interpretation | |
|---|---|---|---|---|
| Height | Weight | |||
| Age | 0.75 | 3.54e-06 | 0.006352 | Only weight shows a significant effect on peak age, but the overall explanatory power of the model is still very weak. |
We also validate the regression assumptions for the model of peak age
against height and weight:
Overall, the assumptions hold reasonably well, with no severe violations
of linear regression requirements.
First, letās look at a table showing the average peak OPS (the
highest OPS season) by fielding position:
OPS_summary <- batting_topOPS |>
group_by(Position) |>
summarise(avg_OPS = round(mean(OPS),3)) |>
arrange(desc(avg_OPS))
OPS_summaryNext, we examine the residual plot of peak OPS by position:
batting_topOPS |>
group_by(Position) |>
summarise(OPS0 = mean(OPS)) |>
mutate(dev = OPS0 - mean(OPS0, na.rm = T),
hjust = ifelse(dev > 0, 1.0, -0.07)) |>
ggplot() +
geom_col(aes(reorder(Position, dev),dev,
fill = dev > 0), show.legend = F) +
geom_text(
aes(x = reorder(Position, dev), y = dev, label = round(dev, 2), hjust = hjust),
vjust = 0.4,
size = 3,
color = "white",
fontface = "bold",
) +
coord_flip() +
scale_fill_manual(values = c("FALSE" = "red2", "TRUE" = "black")) +
labs(title = "Max-OPS Age Deviation by Position", x = "Fielding Position", y = "Deviation from Mean OPS at Max OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 18),
axis.text.y = element_text(size = 10))
We can see that the top two positions in peak OPS are Designated Hitter
(DH) and First Base (1B), while the lowest two are Shortstop (SS) and
Second Base (2B).
Next, letās explore the heatmap of playersā Height, Weight, and
OPS:
batting_topOPS |>
mutate(
weight_bin = cut(weight, breaks = xbk, labels = xbk1, include.lowest = T),
height_bin = cut(height, breaks = ybk, labels = ybk1, include.lowest = T)
) |>
group_by(weight_bin, height_bin) |>
summarise(kpi = mean(OPS, na.rm = T)) |> ungroup() |>
complete(weight_bin, height_bin, fill = list(kpi = NA)) |>
ggplot(aes(weight_bin, height_bin)) +
geom_tile(aes(fill = kpi)) +
geom_text(aes(label = round(kpi, 3)), color = "white", size = 4) +
scale_fill_gradient(low = "gray", high = "#000" , na.value = "white") +
labs(title = "Max-OPS Age Avg OPS by Weight and Height Combination in MLB Hitters",x = "Weight", y = "Height", fill = "Max-OPS Age Avg OPS") +
theme_bw() +
theme(plot.title = element_text(hjust = -1, size = 18),
axis.text.x = element_text(hjust = 1),
legend.title = element_text(hjust = 0.5) )Compared with the peak age analysis, the OPS heatmap reveals a
clearer trend: taller and heavier players tend to have higher OPS.
Designated Hitters (DH) and First Basemen (1B) are generally taller and
heavier, which matches the residual plot where they also showed higher
peak OPS values.
After noticing this interesting pattern, we further investigate which
component of OPS ā On-base Percentage (OBP) or Slugging Percentage (SLG)
ā contributes more to this difference.
Below are the heatmaps for
OBP and SLG:
OBP_plot = batting_topOPS |>
mutate(
weight_bin = cut(weight, breaks = xbk, labels = xbk1, include.lowest = T),
height_bin = cut(height, breaks = ybk, labels = ybk1, include.lowest = T)
) |>
group_by(weight_bin, height_bin) |>
summarise(kpi = mean(OBP, na.rm = T)) |> ungroup() |>
complete(weight_bin, height_bin, fill = list(kpi = NA)) |>
ggplot(aes(weight_bin, height_bin)) +
geom_tile(aes(fill = kpi)) +
geom_text(aes(label = round(kpi, 3)), color = "white", size = 3.5) +
scale_fill_gradient(low = "gray", high = "#000" , na.value = "white") +
ggtitle("OBP") +
labs(x = "Weight", y = "Height", fill = "Max-OPS Age Avg OBP") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.text.x = element_text(hjust = 1),
legend.title = element_text(hjust = 0.5) )
SLG_plot = batting_topOPS |>
mutate(
weight_bin = cut(weight, breaks = xbk, labels = xbk1, include.lowest = T),
height_bin = cut(height, breaks = ybk, labels = ybk1, include.lowest = T)
) |>
group_by(weight_bin, height_bin) |>
summarise(kpi = mean(SLG, na.rm = T)) |> ungroup() |>
complete(weight_bin, height_bin, fill = list(kpi = NA)) |>
ggplot(aes(weight_bin, height_bin)) +
geom_tile(aes(fill = kpi)) +
geom_text(aes(label = round(kpi, 3)), color = "white", size = 3.5) +
scale_fill_gradient(low = "gray", high = "#000" , na.value = "white") +
ggtitle("SLG") +
labs(x = "Weight", y = "Height", fill = "Max-OPS Age Avg SLG") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.text.x = element_text(hjust = 1),
legend.title = element_text(hjust = 0.5) )
OBP_plot / SLG_plot + plot_annotation(title = "Max-OPS Age Avg OBP and SLG by Weight and Height Combination in MLB Hitters",
theme = theme(plot.title = element_text(hjust = 0.5, size = 18)))
We can see that larger (taller/heavier) hitters show a
noticeably stronger effect on SLG, while
OBP does not show significant differences.
We then run regression analyses of peak OPS against height and weight
to verify the relationship.
OPS:
Call:
lm(formula = OPS ~ height + weight, data = batting_topOPS)
Residuals:
Min 1Q Median 3Q Max
-0.8663 -0.1192 -0.0321 0.0606 4.1279
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2083722 0.1351950 1.541 0.12333
height 0.0058123 0.0021187 2.743 0.00611 **
weight 0.0011682 0.0002096 5.575 2.63e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2401 on 4251 degrees of freedom
Multiple R-squared: 0.01879, Adjusted R-squared: 0.01832
F-statistic: 40.69 on 2 and 4251 DF, p-value: < 2.2e-16
OBP:
Call:
lm(formula = OBP ~ height + weight, data = batting_topOPS)
Residuals:
Min 1Q Median 3Q Max
-0.37018 -0.04483 -0.01229 0.02403 0.63432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.258e-01 5.166e-02 6.307 3.14e-10 ***
height 6.555e-04 8.095e-04 0.810 0.418
weight -2.105e-05 8.007e-05 -0.263 0.793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09175 on 4251 degrees of freedom
Multiple R-squared: 0.0001668, Adjusted R-squared: -0.0003036
F-statistic: 0.3546 on 2 and 4251 DF, p-value: 0.7015
SLG:
Call:
lm(formula = SLG ~ height + weight, data = batting_topOPS)
Residuals:
Min 1Q Median 3Q Max
-0.5020 -0.0857 -0.0239 0.0430 3.4995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1174070 0.0977304 -1.201 0.229688
height 0.0051568 0.0015316 3.367 0.000767 ***
weight 0.0011892 0.0001515 7.851 5.2e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1736 on 4251 degrees of freedom
Multiple R-squared: 0.03404, Adjusted R-squared: 0.03358
F-statistic: 74.9 on 2 and 4251 DF, p-value: < 2.2e-16
| Variable | P-value | Adjusted R² | Interpretation | |
|---|---|---|---|---|
| Height | Weight | |||
| OPS | 0.0061 | 2.63e-08 | 0.0183 | Both height and weight have significant effects on OPS, but the overall explanatory power remains weak. |
| OBP | 0.418 | 0.793 | -0.0003 | No significance for either variable; predictive power is essentially zero. |
| SLG | 0.00077 | 5.2e-15 | 0.0336 | Both variables are significant, but the predictive power is still weak. |
We also validate the regression assumptions for the OPS model against
height and weight:
Overall, the model does not show major violations of the basic
assumptions of linear regression.
(2 min read)
| Analysis Item | Result | Explanation |
|---|---|---|
| Peak Age | Average ~28.5 years | The differences in peak age across positions are small, with an overall average around 28.5 years. |
| Peak OPS | Average ~0.85 | The differences in peak OPS across positions are larger. The top two are Designated Hitter (DH) and First Base (1B), while the bottom two are Shortstop (SS) and Second Base (2B). |
| Impact of Height/Weight on Peak Age | Height: not significant; Weight: slight negative effect | Height does not significantly affect peak age, but heavier players tend to peak slightly earlier. |
| Impact of Height/Weight on Peak OPS | Both positively correlated | Height and weight are significantly positively related to peak OPS. However, the effect is primarily driven by Slugging Percentage (SLG) rather than On-base Percentage (OBP). |
Although the batting_topOPS dataset shows that peak
ages are not significantly different across positions, the earlier
AgeāOPS plots using batting_data reveal that players at
less defensively demanding positions such as Designated Hitter (DH) and
First Base (1B) tend to reach their OPS peak later.
This difference
likely occurs because star hitters (whose OPS is consistently above
average) often transition to DH or 1B in the later stages of their
careers. This raises the OPS-age curve for those positions, making their
peak age appear later.
For example, an outfielder who declines
defensively but retains strong offensive ability is often moved to DH or
1B to extend his career. As a result, these positions include more older
players who still maintain high OPS, causing the estimated peak age from
batting_data to shift upward compared with other
positions.